Name: Ravisankar Chengannagari
Name: Ashish Bhandari
Air pollution significantly impacts global health, ecosystems, and economies, necessitating vigilant monitoring and predictive analysis. This study addresses this concern by analyzing air quality trends in New York City and globally, with a focus on Nitrogen dioxide (NO2) levels in New York and PM2.5 concentrations worldwide. The research aims to identify historical trends, understand regional disparities, and examine contributing factors to air pollution, thereby informing policy decisions.
Data is sourced from authoritative platforms, including the NYC Open Data platform and the World Health Organization, utilizing both API access and direct downloads. The methodology encompasses data collection, thorough cleaning and preprocessing, exploratory data analysis, and advanced statistical techniques such as correlation and regression analysis. Geospatial visualization tools highlight pollution hotspots, facilitating easy comparison across regions.
The study reveals key insights into air quality trends, showcasing regional disparities and identifying urban areas with critical air quality issues. The comparative analysis sheds light on PM2.5 concentrations in New York City relative to other major urban centers, uncovering significant findings relevant for urban planning and environmental policies. This project offers a data-driven foundation for understanding air quality trends and contributes to informed decision-making in environmental management. The outcomes highlight the importance of effective air quality management policies and provide a framework for future research in this domain.
Air pollution is a significant global issue with serious consequences for public health, ecosystems, and economies. As environmental degradation continues to be a concern, the need for effective monitoring and analysis of air quality trends becomes increasingly crucial. This project aims to study Nitrogen dioxide (NO2) levels in New York City and PM2.5 concentrations worldwide, utilizing advanced statistical and machine learning methods to identify trends, understand regional disparities, and evaluate the extent of air pollution. The goal is to provide a solid foundation for data-driven policy-making and to contribute to improving environmental conditions.
The motivation behind this research is driven by a multifaceted concern for the well-being of both people and the planet. Air pollution poses serious threats to public health, leading to respiratory and cardiovascular diseases that strain healthcare systems and diminish quality of life. Understanding and addressing pollutants such as Nitrogen dioxide (NO2) and PM2.5 is critical because of their detrimental health effects.
Moreover, New York City, as a major economic hub, exemplifies how air pollution can impact economic vitality. The city's industrial activity, vehicular emissions, and urbanization contribute to poor air quality, which, in turn, affects workforce productivity and healthcare costs. By examining trends and disparities in air quality, this research aims to inform policies that improve environmental conditions and enhance economic resilience.
From a global perspective, air pollution not only impacts diverse regions but also highlights the need for effective interventions and sustainable urban planning. The project seeks to provide insights that address critical scientific and business concerns, supporting targeted interventions for healthier environments.
In addition to public health and economic concerns, this research is motivated by the desire to enhance environmental management and support businesses in mitigating pollution-related risks. By contributing to corporate social responsibility initiatives and fostering the development of environmental technologies, this study aligns with broader efforts to create sustainable urban environments that benefit both scientific understanding and economic interests.
The research questions formulated for this project aim to provide a comprehensive understanding of air quality trends and factors affecting air pollution, both locally in New York City and globally. Here’s a detailed explanation of each question:
Research Question 1: Trend Analysis
What are the historical trends in air quality within New York and on a global scale, and how do these trends compare over time?
Objective: To ascertain the historical trends in air quality within New York City and globally, and to compare these trends over time to identify patterns of improvement or deterioration.
Rationale: This analysis provides an in-depth look at how air quality has evolved over the years, both locally and internationally. By examining data on NO2 and PM2.5 levels over extended periods, we can pinpoint periods of significant change which may correlate with specific regulatory actions, industrial growth, or urban development phases. This detailed temporal mapping helps identify when and where air quality initiatives have been successful or where they have failed, offering a historical perspective that enhances future air quality forecasting and management strategies.
Research Question 2: Extremes in Air Quality
Among global regions, which exhibit the highest and lowest levels of air pollution, and what insights can we gain about the factors contributing to these extremes?
Objective: To identify which global regions exhibit the highest and lowest levels of air pollution and to explore the factors contributing to these extremes.
Rationale: Understanding the extremes in air quality across different regions illuminates the most and least polluted areas, prompting a deeper investigation into the environmental, socioeconomic, and policy conditions that lead to such disparities. By studying areas with extreme pollution, we can explore specific local factors like heavy industrialization, low regulations, or geographical and meteorological conditions that contribute to poor air quality. Conversely, regions with exceptionally clean air provide insights into successful environmental practices and regulations, which can serve as models for other areas.
Research Question 3: Local Analysis
Within New York, which neighborhood stands out for having the worst air quality, and what are the potential local contributors to this status?
Objective: To determine which neighborhoods in New York City experience the worst air quality and to investigate potential local contributors to this status.
Rationale: This question delves into the granular impacts of air pollution at the neighborhood level within a major metropolitan area. By identifying the most polluted neighborhoods, the analysis can focus on localized sources of pollution such as traffic congestion, specific industrial activities, or lack of green spaces. Understanding these factors allows for the development of localized interventions that can directly address the sources of pollution in a targeted manner, potentially leading to more effective solutions.
Research Question 4: Comparative Analysis of PM2.5 Concentrations
How do PM2.5 concentrations in New York City compare to other major urban areas around the world?
Objective: To compare PM2.5 concentrations in New York City with those in other major urban areas around the world.
Rationale: This comparison sheds light on how New York City's air quality management strategies measure up against those implemented in other global cities facing similar challenges. By examining PM2.5 levels across different cities, we can assess the relative success of air quality controls and urban planning strategies. This analysis not only highlights areas where New York City may need to bolster its efforts but also provides an opportunity to learn from the successes of other cities. This comparative perspective is essential for adopting best practices and innovative solutions in air quality management.
In this project, we derive data from two critical sources that provide comprehensive and authoritative datasets on air quality, enabling a detailed analysis of both local and global pollution levels. The specific characteristics of each source are elaborated below to emphasize their relevance and reliability in supporting our air quality research.
In our final project, we have adopted a systematic research approach to thoroughly analyze air quality trends in New York City and globally. This approach encompasses several distinct yet interconnected phases: data acquisition, data management, data preparation, exploratory data analysis (EDA), and investigative analysis. Each phase is designed to ensure that our findings are both robust and insightful, supporting effective policy recommendations.
Data Acquisition
Our research project's success critically hinges on the systematic acquisition of high-quality, authoritative data concerning air quality. This section details the meticulous process involved in identifying, selecting, and acquiring the necessary datasets from two primary sources: NYC Open Data and the World Health Organization (WHO). Each step in this process is crafted to ensure that the data not only meets our specific research needs but also adheres to the highest standards of data integrity and reliability.
Initial Data Search:
The search for appropriate data sources began with a comprehensive review of available environmental data repositories that provide open access to air quality measurements. Our criteria for selection included data comprehensiveness, update frequency, geographical specificity, and the reliability of the source. After evaluating several potential sources, NYC Open Data and the WHO were identified as the most suitable for providing the detailed and reliable datasets required for our analysis.
Evaluation of Data Quality and Relevance:
Before finalizing our choice of data sources, we conducted a preliminary assessment of the data quality. This involved reviewing the data collection methodologies used by each source, the frequency of data updates, and the historical depth of the datasets. Both chosen platforms demonstrated robust data collection protocols and provided extensive documentation on their methodologies, ensuring the datasets relevance and reliability for our study.
Methods of Data Acquisition
NYC Open Data:
Platform Use: Utilizing the NYC Open Data platform involved accessing their comprehensive API, which provides real-time data feeds and historical data access. This API facilitates the integration of live data streams into our analysis tools, enabling up-to-date and longitudinal studies of NO2 levels across New York City.
API Integration: The integration process included setting up API calls tailored to retrieve air quality data specific to our geographic and pollutant criteria. The data is then automatically pulled into our PostgreSQL database, configured to handle large volumes of time-series data efficiently.
World Health Organization (WHO):
Data Compliance and Ethical Considerations
In conducting this research, special attention was given to ensuring compliance with legal standards and ethical guidelines related to data usage. The datasets utilized from NYC Open Data and the World Health Organization (WHO) are publicly available and explicitly provided for research and analysis purposes, thus ensuring legal compliance.
Legal and Open Access Compliance:
NYC Open Data:
World Health Organization (WHO):
Ethical Considerations:
The data acquisition strategy outlined herein reflects a thorough and deliberate approach to sourcing, integrating, and managing critical environmental data. This foundational work is vital for empowering our subsequent exploratory and investigative analyses, ultimately enabling a nuanced understanding of air quality trends and informing effective policy interventions.
Data Management and Storage
In our research project examining air quality trends in New York City and globally, effective data management and storage are crucial to ensure data integrity, facilitate efficient analysis, and support the scalability of the project. This section elaborates on our sophisticated approach to managing and securely storing large volumes of environmental data, utilizing advanced database technologies and adhering to best practices.
Data Storage Infrastructure
Data Management Processes
Data is automatically ingested from the NYC Open Data API and WHO datasets via scripts that run at predetermined intervals. This ensures that our database is consistently updated with the most recent data without manual intervention.
Use of PostgreSQL for data management in this project provides a strong foundation for handling the complex and sensitive data involved in air quality research. The structured approach to data management and storage ensures that our analysis is supported by data that is not only secure and well-managed but also consistently reliable and accessible. This infrastructure is crucial for delivering accurate, actionable insights into air quality trends and for supporting the broader objectives of environmental health research.
Data Preparation
The data preparation phase of our research project is meticulously designed to ensure the integrity and quality of the data used for our analysis of air quality trends in New York City and globally. This crucial stage sets the groundwork for accurate and reliable results, adhering to high standards. The following details the comprehensive steps undertaken to clean, preprocess, and prepare the datasets obtained from NYC Open Data and the World Health Organization (WHO).
Data Cleaning:
The focused and rigorous data cleaning process employed in this project forms the foundation for all subsequent analyses. By ensuring that our datasets are free from inaccuracies and inconsistencies, we establish a robust base for exploring air quality trends. This meticulous attention to data integrity not only enhances the credibility of our research but also ensures that the insights derived are based on the most reliable data available. As such, our data cleaning efforts are crucial in enabling informed, data-driven conclusions essential for understanding and mitigating air pollution effectively.
Exploratory Data Analysis (EDA)
Strategy Exploratory Data Analysis (EDA) is a fundamental component of our research project, providing the initial deep dive into the air quality data collected from New York City and globally. EDA helps uncover underlying patterns, identify anomalies, and gain a thorough understanding of the dataset's characteristics, which are essential for guiding further statistical analysis and predictive modeling.
Objectives of EDA
The primary objectives of our EDA are to:
EDA Techniques Employed
Tools and Technologies Used
Quality Assurance in EDA
This groundwork supports more sophisticated analyses and helps us draw meaningful, data-driven conclusions that can inform policy decisions and contribute to the broader discourse on environmental health and air quality management.
In conclusion, our comprehensive research approach, meticulously designed across multiple phases—Data Acquisition, Data Preparation, Exploratory Data Analysis, and Investigative Analysis—ensures the integrity and depth of our analysis of air quality trends in New York City and globally. This methodological rigor facilitates a seamless transition from data collection to deep analytical insights, enabling us to address complex environmental questions with precision. Through this structured progression, we uphold stringent academic and ethical standards, ensuring that our findings are not only scientifically robust but also of practical relevance to policy making and public health. Ultimately, our research approach is aimed at providing actionable insights that can significantly impact environmental management and urban planning in the context of air quality improvement.
#imported the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import psycopg2
import http.client
import folium
import plotly.graph_objs as go
import plotly.offline as pyo
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
import warnings
warnings.filterwarnings('ignore')
#read csv file which is stored in github repository
data = pd.read_csv('https://raw.githubusercontent.com/ravi2248/AIM-5001/main/Air_Quality.csv')
#we can see the top 5 rows
data.head()
| Unique ID | Indicator ID | Name | Measure | Measure Info | Geo Type Name | Geo Join ID | Geo Place Name | Time Period | Start_Date | Data Value | Message | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 179772 | 640 | Boiler Emissions- Total SO2 Emissions | Number per km2 | number | UHF42 | 409.0 | Southeast Queens | 2015 | 01/01/2015 | 0.3 | NaN |
| 1 | 179785 | 640 | Boiler Emissions- Total SO2 Emissions | Number per km2 | number | UHF42 | 209.0 | Bensonhurst - Bay Ridge | 2015 | 01/01/2015 | 1.2 | NaN |
| 2 | 178540 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | UHF42 | 209.0 | Bensonhurst - Bay Ridge | Annual Average 2012 | 12/01/2011 | 8.6 | NaN |
| 3 | 178561 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | UHF42 | 409.0 | Southeast Queens | Annual Average 2012 | 12/01/2011 | 8.0 | NaN |
| 4 | 823217 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | UHF42 | 409.0 | Southeast Queens | Summer 2022 | 06/01/2022 | 6.1 | NaN |
#we can see the bottom 5 rows
data.tail()
| Unique ID | Indicator ID | Name | Measure | Measure Info | Geo Type Name | Geo Join ID | Geo Place Name | Time Period | Start_Date | Data Value | Message | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18020 | 816914 | 643 | Annual vehicle miles traveled | Million miles | per square mile | CD | 503.0 | Tottenville and Great Kills (CD3) | 2019 | 01/01/2019 | 12.9 | NaN |
| 18021 | 816913 | 643 | Annual vehicle miles traveled | Million miles | per square mile | CD | 503.0 | Tottenville and Great Kills (CD3) | 2010 | 01/01/2010 | 14.7 | NaN |
| 18022 | 816872 | 643 | Annual vehicle miles traveled | Million miles | per square mile | UHF42 | 208.0 | Canarsie - Flatlands | 2010 | 01/01/2010 | 43.4 | NaN |
| 18023 | 816832 | 643 | Annual vehicle miles traveled | Million miles | per square mile | UHF42 | 407.0 | Southwest Queens | 2010 | 01/01/2010 | 65.8 | NaN |
| 18024 | 151658 | 643 | Annual vehicle miles traveled | Million miles | per square mile | UHF42 | 408.0 | Jamaica | 2005 | 01/01/2005 | 41.0 | NaN |
#it will show the number of columns and rows in the dataset
data.shape
(18025, 12)
#we can see the columns details
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18025 entries, 0 to 18024 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unique ID 18025 non-null int64 1 Indicator ID 18025 non-null int64 2 Name 18025 non-null object 3 Measure 18025 non-null object 4 Measure Info 18025 non-null object 5 Geo Type Name 18025 non-null object 6 Geo Join ID 18016 non-null float64 7 Geo Place Name 18016 non-null object 8 Time Period 18025 non-null object 9 Start_Date 18025 non-null object 10 Data Value 18025 non-null float64 11 Message 0 non-null float64 dtypes: float64(3), int64(2), object(7) memory usage: 1.7+ MB
#here we are connecting the postgresql
#to read the data present in the database
con = psycopg2.connect(
dbname = 'airquality',
user = 'postgres',
password = 'Ravi1234@',
host = 'localhost',
port = '5432'
)
#this query will return all the data stored in postgresql
query = "SELECT * FROM air_quality"
#we need to use read_sql func from pandas to read the data and stored it
global_data = pd.read_sql(query, con)
#we closed the connection after stored it
con.close()
#we can see the top 5 rows
global_data.head()
| indicatorcode | indicator | parentlocationcode | parentlocation | locationtype | spatialdimvaluecode | location | periodtype | period | islatestyear | dim1type | dim1 | dim1valuecode | factvaluenumeric | factvaluenumericlow | factvaluenumerichigh | value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | AFR | Africa | Country | KEN | Kenya | Year | 2019 | True | Residence Area Type | Cities | RESIDENCEAREATYPE_CITY | 10.01 | 6.29 | 13.74 | 10.01 [6.29-13.74] |
| 1 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | AMR | Americas | Country | TTO | Trinidad and Tobago | Year | 2019 | True | Residence Area Type | Rural | RESIDENCEAREATYPE_RUR | 10.02 | 7.44 | 12.55 | 10.02 [7.44-12.55] |
| 2 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | EUR | Europe | Country | GBR | United Kingdom of Great Britain and Northern I... | Year | 2019 | True | Residence Area Type | Cities | RESIDENCEAREATYPE_CITY | 10.06 | 9.73 | 10.39 | 10.06 [9.73-10.39] |
| 3 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | AMR | Americas | Country | GRD | Grenada | Year | 2019 | True | Residence Area Type | Total | RESIDENCEAREATYPE_TOTL | 10.08 | 7.07 | 13.20 | 10.08 [7.07-13.20] |
| 4 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | AMR | Americas | Country | BRA | Brazil | Year | 2019 | True | Residence Area Type | Towns | RESIDENCEAREATYPE_TOWN | 10.09 | 8.23 | 12.46 | 10.09 [8.23-12.46] |
#we can see the bottom 5 rows
global_data.tail()
| indicatorcode | indicator | parentlocationcode | parentlocation | locationtype | spatialdimvaluecode | location | periodtype | period | islatestyear | dim1type | dim1 | dim1valuecode | factvaluenumeric | factvaluenumericlow | factvaluenumerichigh | value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9445 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | AMR | Americas | Country | BLZ | Belize | Year | 2010 | False | Residence Area Type | Cities | RESIDENCEAREATYPE_CITY | 9.92 | 3.91 | 20.28 | 9.92 [3.91-20.28] |
| 9446 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | AMR | Americas | Country | TTO | Trinidad and Tobago | Year | 2010 | False | Residence Area Type | Cities | RESIDENCEAREATYPE_CITY | 9.92 | 7.80 | 12.89 | 9.92 [7.80-12.89] |
| 9447 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | AFR | Africa | Country | KEN | Kenya | Year | 2010 | False | Residence Area Type | Cities | RESIDENCEAREATYPE_CITY | 9.94 | 6.30 | 13.57 | 9.94 [6.30-13.57] |
| 9448 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | AMR | Americas | Country | USA | United States of America | Year | 2010 | False | Residence Area Type | Cities | RESIDENCEAREATYPE_CITY | 9.95 | 9.78 | 10.11 | 9.95 [9.78-10.11] |
| 9449 | SDGPM25 | Concentrations of fine particulate matter (PM2.5) | EMR | Eastern Mediterranean | Country | AFG | Afghanistan | Year | 2010 | False | Residence Area Type | Cities | RESIDENCEAREATYPE_CITY | 92.79 | 66.17 | 128.40 | 92.79 [66.17-128.44] |
#it will show the number of columns and rows in the dataset
global_data.shape
(9450, 17)
#we can see the columns details
global_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9450 entries, 0 to 9449 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 indicatorcode 9450 non-null object 1 indicator 9450 non-null object 2 parentlocationcode 9450 non-null object 3 parentlocation 9450 non-null object 4 locationtype 9450 non-null object 5 spatialdimvaluecode 9450 non-null object 6 location 9450 non-null object 7 periodtype 9450 non-null object 8 period 9450 non-null object 9 islatestyear 9450 non-null bool 10 dim1type 9450 non-null object 11 dim1 9450 non-null object 12 dim1valuecode 9450 non-null object 13 factvaluenumeric 9450 non-null float64 14 factvaluenumericlow 9450 non-null float64 15 factvaluenumerichigh 9450 non-null float64 16 value 9450 non-null object dtypes: bool(1), float64(3), object(13) memory usage: 1.2+ MB
For exploring the data, we have made so many plots like bar plots, line plots, etc... and used folium and plotly libraries for maps to analyse the trends in the data. We can see the below plots and their results.
#Here we can see the unique names of indicators and there frequency in the dataset
data['Name'].value_counts()
Name Nitrogen dioxide (NO2) 5922 Fine particles (PM 2.5) 5922 Ozone (O3) 2115 Asthma emergency departments visits due to Ozone 485 Asthma hospitalizations due to Ozone 484 Asthma emergency department visits due to PM2.5 480 Annual vehicle miles traveled (cars) 321 Annual vehicle miles traveled 321 Annual vehicle miles traveled (trucks) 321 Respiratory hospitalizations due to PM2.5 (age 20+) 240 Cardiovascular hospitalizations due to PM2.5 (age 40+) 240 Cardiac and respiratory deaths due to Ozone 240 Deaths due to PM2.5 240 Outdoor Air Toxics - Benzene 203 Outdoor Air Toxics - Formaldehyde 203 Boiler Emissions- Total SO2 Emissions 96 Boiler Emissions- Total NOx Emissions 96 Boiler Emissions- Total PM2.5 Emissions 96 Name: count, dtype: int64
#It will plot a bar graph
#we can see the names of indicators and their frequencies
plt.figure(figsize = (10, 8))
data['Name'].value_counts().plot(kind = 'bar', color = 'lightgreen')
plt.xlabel('Name')
plt.ylabel('Frequency')
plt.xticks(rotation = 45, ha = 'right')
plt.tight_layout()
plt.show()
Discussion of result:
From the above bar graph, we can see the indicators and their frequencies. We can observe that Nitrogen dioxide (N02) (5922), Fine particles (PM 2.5) (5922), and Ozone (03) (2115) have high frequncies in the dataset. And Boiler Emissions - Total SO2 Emissions (96), Total NOx Emissions (96), and Total PM2.5 Emissions (96) have less frequencies in the dataset.
#Here we can see the frequencies of type of geographic data
data['Geo Type Name'].value_counts()
Geo Type Name UHF42 7140 CD 6490 UHF34 3366 Borough 859 Citywide 170 Name: count, dtype: int64
#It will plot a bar graph
#we can see the types of geographic areas and their frequencies
plt.figure(figsize = (10, 6))
data['Geo Type Name'].value_counts().plot(kind = 'bar', color = 'skyblue')
plt.xlabel('Geo Type Name')
plt.ylabel('Frequency')
plt.xticks(rotation = 45, ha = 'right')
plt.tight_layout()
plt.show()
Discussion of Result:
From the above graph, we can see that types of geographic areas and their frequencies. We can observe that UHF42 (7140), and CD (6490) have high frequencies and Citywide (170) has less frequency as compared to others in the dataset.
#Here we are getting the data of the indicator named Fine particles (PM 2.5)
pm2_5_data = data[data['Name'] == 'Fine particles (PM 2.5)']
#Here we are changing the data type to datetime
pm2_5_data['Start_Date'] = pd.to_datetime(pm2_5_data['Start_Date'])
#It will plot PM 2.5 concentration over time
#on x-axis we can see time and on y-axis we can see PM 2.5 concentration
plt.figure(figsize = (10, 6))
plt.plot(pm2_5_data['Start_Date'], pm2_5_data['Data Value'], marker = 'o', linestyle = '', color = 'orange')
plt.title('Trend of PM2.5 Concentration Over Time')
plt.xlabel('Time')
plt.ylabel('PM2.5 Concentration (mcg/m3)')
plt.xticks(rotation = 45)
plt.grid(True)
plt.tight_layout()
plt.show()
Discussion of Result:
From the above graph, we can see the trend of PM 2.5 concentrations over time. Here, we can observe some dates have high PM 2.5 concentrations and some other dates have less PM 2.5 concentrations. It is look like up and downs in the PM 2.5 concentrations over time.
#it shows a line plot
#we can see new york city air quality trends over time
plt.figure(figsize = (12, 6))
sns.lineplot(data = data, x = "Start_Date", y = "Data Value", hue = "Name", style = "Name", markers = True, dashes = False)
plt.title("New York City Air Quality Trends Over Time")
plt.xlabel("Date")
plt.ylabel("Data Value")
plt.xticks(rotation = 90)
plt.legend(title = "Indicator", bbox_to_anchor = (1, 1), loc = 'upper left')
plt.tight_layout()
plt.show()
Discussion of Result:
From the above graph, we can see the New York City Air Quality trends over time. Here, we can observe the different pollutants. We can see the PM 2.5 concentrations which is in orange color, the graph starts in 2008 and ends at 2022, we can observe some minor ups and downs in the PM 2.5 concentration value.
#it will plot a boxplot graph
#it shows distribution of air quality across countries
plt.figure(figsize = (12, 6))
sns.boxplot(data = global_data, x = "parentlocation", y = "factvaluenumeric")
plt.title("Distribution of Air Quality Indicator Across Countries")
plt.xlabel("Country")
plt.ylabel("Data Value")
plt.xticks(rotation = 45)
plt.tight_layout()
plt.show()
Discussion of Result:
From the above graph, we can see the distribution of air quality across parent location of countries. Here, we can observe the distribution of air quality across different parent locations of countries like Africa, Americas, Europe, Western Pacific, South-East Asia, and Eastern Mediterranean. It looks like Eastern Mediterranean countries has higher value.
#it plots a barplot
#it shows comparision of PM 2.5 Concentrations across major urban areas
plt.figure(figsize = (14, 8))
sns.barplot(data = data[data["Name"] == "Fine particles (PM 2.5)"], x = "Geo Place Name", y = "Data Value", ci = None)
plt.title("Comparison of PM 2.5 concentrations across major Urban Areas")
plt.xlabel("Urban Area")
plt.ylabel("Mean PM2.5 Concentration")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
Discussion of Result:
From the above graph, we can see the comparision of PM 2.5 concentrations across major urban areas. Here, we can observe the comparision of PM 2.5 of urban areas, there are so many ups and downs in the PM 2.5 concentrations means the neighbourhoods have different PM 2.5 concentrations.
#it shows a line plot
#we can see global air quality trends over time
plt.figure(figsize = (12, 6))
sns.lineplot(data = global_data, x = "period", y = "factvaluenumeric", hue = "parentlocation", style = "parentlocation", markers = True, dashes = False)
plt.title("Global Air Quality Trends over time")
plt.xlabel("Year")
plt.ylabel("Data Value")
plt.xticks(rotation = 45)
plt.legend(title = "Parent Location", bbox_to_anchor = (1, 1), loc = 'upper left')
plt.tight_layout()
plt.show()
Discussion of Result:
From the above graph, we can see the global air quality trends over time. We can observe the data is from 2010 to 2019. And Eastern Mediterranean has higher PM 2.5 concentration value and Americas has lower PM 2.5 concentration value. If we see the graph properly, we can observe that there is not much change in the graph of Americas.
#it shows a bar plot
#it shows maximum air quality indicators across global regions
plt.figure(figsize = (10, 6))
max_values = global_data.groupby("parentlocation")["factvaluenumeric"].max().sort_values(ascending=False)
sns.barplot(x = max_values.values, y = max_values.index, palette = "viridis")
plt.title("Maximum Air Quality Indicators Across Global Regions")
plt.xlabel("Maximum Data Value")
plt.ylabel("Region")
plt.tight_layout()
plt.show()
Discussion of Result:
From the above graph, we can see clearly the maximum air quality across parent location of countries. Here, we can observe the distribution of air quality across different parent locations of countries like Africa, Americas, Europe, Western Pacific, South-East Asia, and Eastern Mediterranean. Eastern Mediterranean countries has higher value.
#here we fetch the values of countries and their PM2.5 concentration values
PM25_map_data = {
'Country': global_data['location'],
'Value': global_data['factvaluenumeric']
}
#we converted the data into dataframe
PM25_map_data = pd.DataFrame(PM25_map_data)
# Create choropleth map
map_data = dict(
type = 'choropleth',
locations = PM25_map_data['Country'],
locationmode = 'country names',
z = PM25_map_data['Value'],
text = PM25_map_data['Country'],
colorscale = 'Blues',
colorbar = {'title' : 'Value'}
)
layout = dict(title = 'PM2.5 Concentrations of global data',
geo = dict(showframe = False, projection = {'type':'mercator'})
)
choromap = go.Figure(data = [map_data],layout = layout)
choromap
Discussion of Result:
From the above map data, we can see the PM 2.5 concentrations of global data. If we point our cursor on the map, we can see the PM 2.5 concentration values of that area. For example, If I place my cursor on United States of America, it will show us the data value of USA which is 6.42. Similarly, we can observe the data of all countries in the map.
#Here we are trying to connect the api which has present info of NYC
conn = http.client.HTTPSConnection("api.ambeedata.com")
headers = {
'x-api-key': "933a144f6ff91631a9dea379dc86e5d83e7454f735c4ed1d494f7acb7f07253f",
'Content-type': "application/json"
}
conn.request("GET", "/latest/by-lat-lng?lat=40.730610&lng=-73.935242", headers=headers)
res = conn.getresponse()
api_data = res.read()
#here we are decoding the data into utf-8
api_data = api_data.decode('utf-8')
#we can see the data
api_data
'{"message":"success","stations":[{"CO":0.531,"NO2":27.06,"OZONE":20.102,"PM10":49.87,"PM25":11.196,"SO2":0.564,"city":"New York","countryCode":"US","division":"New York","lat":40.7139,"lng":-74.007,"placeName":"Broadway","postalCode":"10007-0052","state":"New York","updatedAt":"2024-05-07T02:00:00.000Z","AQI":47,"aqiInfo":{"pollutant":"PM2.5","concentration":11.196,"category":"Good"}}]}'
#here we are importing the json library
import json
#we are loading the api string data to convert it into dictionary
api_data = json.loads(api_data)
#here we can see it
api_data
{'message': 'success',
'stations': [{'CO': 0.531,
'NO2': 27.06,
'OZONE': 20.102,
'PM10': 49.87,
'PM25': 11.196,
'SO2': 0.564,
'city': 'New York',
'countryCode': 'US',
'division': 'New York',
'lat': 40.7139,
'lng': -74.007,
'placeName': 'Broadway',
'postalCode': '10007-0052',
'state': 'New York',
'updatedAt': '2024-05-07T02:00:00.000Z',
'AQI': 47,
'aqiInfo': {'pollutant': 'PM2.5',
'concentration': 11.196,
'category': 'Good'}}]}
#Here we separated the above api data took values of latitude, longitude, and PM 2.5
ny_present_data = {
'Latitude': [api_data['stations'][0]['lat']],
'Longitude': [api_data['stations'][0]['lng']],
'PM2.5': [api_data['stations'][0]['PM25']]
}
#Here we converted into dataframe
ny_present_data = pd.DataFrame(ny_present_data)
#it will create a map centered around NYC
nyc_map = folium.Map(location = [api_data['stations'][0]['lat'], api_data['stations'][0]['lng']], zoom_start = 10)
#here we are adding markers for the data point
for index, row in ny_present_data.iterrows():
folium.CircleMarker(
location = [row['Latitude'], row['Longitude']],
radius = row['PM2.5'],
color = 'red',
fill = True,
fill_color = 'red',
fill_opacity = 1,
popup = f"PM2.5: {row['PM2.5']} µg/m³"
).add_to(nyc_map)
#we can see the map
nyc_map
Discussion of Result:
From the above map data, we can see the PM 2.5 concentration value of New York city data. If we click our cursor on the red point which is on the New York City, we can see the PM 2.5 concentration value of New York which is 12.614 currently (it will change based on time if we re-run the code cells and also api calls are limited and also the radius of the circle is based on the PM2.5 data value).
We have to prepare the data by eliminating and imputing the missing values from the data, extracting features from the data and have to develop a predictive model.
Steps:
#here we can see the total null values of ny data columns
data.isnull().sum()
Unique ID 0 Indicator ID 0 Name 0 Measure 0 Measure Info 0 Geo Type Name 0 Geo Join ID 9 Geo Place Name 9 Time Period 0 Start_Date 0 Data Value 0 Message 18025 dtype: int64
Here, we can see there some missing value columns like Geo Join ID (9), Geo Place Name (9), and Message (18025).
#Here we are counting the unique values of Message column
data['Message'].value_counts()
Series([], Name: count, dtype: int64)
Here, we can observe that there is no value in the Message column. So, it is better to drop this column from the dataset so that the data will gets cleaned.
#we are dropping the column named Message
data = data.drop(columns = ['Message'])
#we can see the top 5 rows after dropping
data.head()
| Unique ID | Indicator ID | Name | Measure | Measure Info | Geo Type Name | Geo Join ID | Geo Place Name | Time Period | Start_Date | Data Value | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 179772 | 640 | Boiler Emissions- Total SO2 Emissions | Number per km2 | number | UHF42 | 409.0 | Southeast Queens | 2015 | 01/01/2015 | 0.3 |
| 1 | 179785 | 640 | Boiler Emissions- Total SO2 Emissions | Number per km2 | number | UHF42 | 209.0 | Bensonhurst - Bay Ridge | 2015 | 01/01/2015 | 1.2 |
| 2 | 178540 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | UHF42 | 209.0 | Bensonhurst - Bay Ridge | Annual Average 2012 | 12/01/2011 | 8.6 |
| 3 | 178561 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | UHF42 | 409.0 | Southeast Queens | Annual Average 2012 | 12/01/2011 | 8.0 |
| 4 | 823217 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | UHF42 | 409.0 | Southeast Queens | Summer 2022 | 06/01/2022 | 6.1 |
#we can see the total null values of ny data columns
data.isnull().sum()
Unique ID 0 Indicator ID 0 Name 0 Measure 0 Measure Info 0 Geo Type Name 0 Geo Join ID 9 Geo Place Name 9 Time Period 0 Start_Date 0 Data Value 0 dtype: int64
Here, we can see there still some missing values in the Geo Join ID and Geo Place Name columns. Both the columns have less missing values if we compare it with the size of the data. So, we can drop those rows instead of imputing these columns which may create false information.
#Here we are dropping rows which has Geo join ID or Geo Place Name values are null
data.dropna(subset = ['Geo Join ID', 'Geo Place Name'], inplace = True)
#now we can see that there is no missing data in the dataset of ny
data.isnull().sum()
Unique ID 0 Indicator ID 0 Name 0 Measure 0 Measure Info 0 Geo Type Name 0 Geo Join ID 0 Geo Place Name 0 Time Period 0 Start_Date 0 Data Value 0 dtype: int64
#here we can see the total null values of global data columns
global_data.isnull().sum()
indicatorcode 0 indicator 0 parentlocationcode 0 parentlocation 0 locationtype 0 spatialdimvaluecode 0 location 0 periodtype 0 period 0 islatestyear 0 dim1type 0 dim1 0 dim1valuecode 0 factvaluenumeric 0 factvaluenumericlow 0 factvaluenumerichigh 0 value 0 dtype: int64
We can see that there is no missing data present in the global data of PM2.5 concentrations
#Here we are extracting features like year from time period data
data['year'] = pd.to_datetime(data['Time Period'], errors = 'coerce').dt.year
#Here we are extracting features like month from time period data
data['month'] = pd.to_datetime(data['Time Period'], errors = 'coerce').dt.month
#we are dropping the null values before building predictive model
data.dropna(inplace = True)
#we are splitting the data into independent and dependent variable
X = data[['Indicator ID', 'Geo Type Name', 'Geo Place Name', 'year', 'month']]
y = data['Data Value']
#Here we are encoding the categorical variables
X = pd.get_dummies(X)
#we are spliting the data into training 80% and testing 20%.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
#making a model of RandomForest
model = RandomForestRegressor()
Here, we have used a RandomForestRegressor model. It is a powerful and flexible model which is widely used for regression tasks because of its ability to handle complex datasets, robustness and scalability.
#we are training the model
model.fit(X_train, y_train)
RandomForestRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor()
#Here we are getting predictions
predictions = model.predict(X_test)
#we are calculating the mean squared error which tell us about the model performance
mse = mean_squared_error(y_test, predictions)
#we can see the mean squared error
mse
74.57973174397588
#it plots a scatter plot
#we can see the distribution of actual values and predicted values
plt.figure(figsize = (8, 6))
plt.scatter(y_test, predictions)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted')
plt.show()
Discussion of Result:
From the above graph, we can see the distribution of actual values VS predicted values. If we observe it properly, we can identify some similarities and relationship between the actual values and the predicted values.
We can see the dataset again after cleaning, and feature engineering steps.
#here we can see the top 5 rows
data.head()
| Unique ID | Indicator ID | Name | Measure | Measure Info | Geo Type Name | Geo Join ID | Geo Place Name | Time Period | Start_Date | Data Value | year | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 179772 | 640 | Boiler Emissions- Total SO2 Emissions | Number per km2 | number | UHF42 | 409.0 | Southeast Queens | 2015 | 01/01/2015 | 0.3 | 2015.0 | 1.0 |
| 1 | 179785 | 640 | Boiler Emissions- Total SO2 Emissions | Number per km2 | number | UHF42 | 209.0 | Bensonhurst - Bay Ridge | 2015 | 01/01/2015 | 1.2 | 2015.0 | 1.0 |
| 16 | 130413 | 640 | Boiler Emissions- Total SO2 Emissions | Number per km2 | number | UHF42 | 210.0 | Coney Island - Sheepshead Bay | 2013 | 01/01/2013 | 0.9 | 2013.0 | 1.0 |
| 17 | 130412 | 640 | Boiler Emissions- Total SO2 Emissions | Number per km2 | number | UHF42 | 209.0 | Bensonhurst - Bay Ridge | 2013 | 01/01/2013 | 1.7 | 2013.0 | 1.0 |
| 18 | 130434 | 640 | Boiler Emissions- Total SO2 Emissions | Number per km2 | number | UHF42 | 410.0 | Rockaways | 2013 | 01/01/2013 | 0.0 | 2013.0 | 1.0 |
#we can see the bottom 5 rows
data.tail()
| Unique ID | Indicator ID | Name | Measure | Measure Info | Geo Type Name | Geo Join ID | Geo Place Name | Time Period | Start_Date | Data Value | year | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18020 | 816914 | 643 | Annual vehicle miles traveled | Million miles | per square mile | CD | 503.0 | Tottenville and Great Kills (CD3) | 2019 | 01/01/2019 | 12.9 | 2019.0 | 1.0 |
| 18021 | 816913 | 643 | Annual vehicle miles traveled | Million miles | per square mile | CD | 503.0 | Tottenville and Great Kills (CD3) | 2010 | 01/01/2010 | 14.7 | 2010.0 | 1.0 |
| 18022 | 816872 | 643 | Annual vehicle miles traveled | Million miles | per square mile | UHF42 | 208.0 | Canarsie - Flatlands | 2010 | 01/01/2010 | 43.4 | 2010.0 | 1.0 |
| 18023 | 816832 | 643 | Annual vehicle miles traveled | Million miles | per square mile | UHF42 | 407.0 | Southwest Queens | 2010 | 01/01/2010 | 65.8 | 2010.0 | 1.0 |
| 18024 | 151658 | 643 | Annual vehicle miles traveled | Million miles | per square mile | UHF42 | 408.0 | Jamaica | 2005 | 01/01/2005 | 41.0 | 2005.0 | 1.0 |
#here we can see the decription of the columns
data.describe()
| Unique ID | Indicator ID | Geo Join ID | Data Value | year | month | |
|---|---|---|---|---|---|---|
| count | 1657.000000 | 1657.000000 | 1657.000000 | 1657.000000 | 1657.000000 | 1657.0 |
| mean | 461609.864816 | 644.091129 | 263.517200 | 32.433736 | 2011.541340 | 1.0 |
| std | 319311.000973 | 1.911578 | 138.287712 | 42.480830 | 4.861327 | 0.0 |
| min | 130397.000000 | 640.000000 | 1.000000 | 0.000000 | 2005.000000 | 1.0 |
| 25% | 154472.000000 | 643.000000 | 201.000000 | 1.900000 | 2005.000000 | 1.0 |
| 50% | 315590.000000 | 644.000000 | 302.000000 | 6.100000 | 2011.000000 | 1.0 |
| 75% | 816928.000000 | 645.000000 | 403.000000 | 56.600000 | 2015.000000 | 1.0 |
| max | 817342.000000 | 647.000000 | 504.000000 | 284.700000 | 2019.000000 | 1.0 |
#here we can see the info of the data
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 1657 entries, 0 to 18024 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unique ID 1657 non-null int64 1 Indicator ID 1657 non-null int64 2 Name 1657 non-null object 3 Measure 1657 non-null object 4 Measure Info 1657 non-null object 5 Geo Type Name 1657 non-null object 6 Geo Join ID 1657 non-null float64 7 Geo Place Name 1657 non-null object 8 Time Period 1657 non-null object 9 Start_Date 1657 non-null object 10 Data Value 1657 non-null float64 11 year 1657 non-null float64 12 month 1657 non-null float64 dtypes: float64(4), int64(2), object(7) memory usage: 181.2+ KB
#it plots a scatter plot
#we can see the scatter plot of Indicator ID and Data value
plt.figure(figsize = (12, 6))
plt.scatter(data['Indicator ID'], data['Data Value'])
plt.xlabel('Indicator ID')
plt.ylabel('Data Value')
plt.title('Scatter Plot of Indicator ID vs. Data Value')
plt.show()
#it plots a scatter plot
#we can see the data values over changing years
plt.scatter(data['year'], data['Data Value'])
plt.xlabel('Years')
plt.ylabel('Data Value')
plt.title('Scatter Plot of Years vs. Data Value')
plt.show()
We have performed some other predictive model analysis other than RandomForestRegressior. The other models gives higher mean squared errors if we compared it with RandomForestRegressor mean squared error.
#we are splitting the data into independent and dependent variable
X = data[['Indicator ID', 'Geo Type Name', 'Geo Place Name', 'year', 'month']]
y = data['Data Value']
#Here we are encoding the categorical variables
X = pd.get_dummies(X)
#we are spliting the data into training 80% and testing 20%.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#making a model of Linear Regression
model = LinearRegression()
#we are training the model
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
#Here we are getting predictions
predictions = model.predict(X_test)
#we are calculating the mean squared error which tell us about the model performance
mse = mean_squared_error(y_test, predictions)
#we can see the mean squared error
mse
1337.7076638095348
Here, we can see the mean squared error which we got from Linear regression model which is very higher and the model not performed well with this dataset. The RandomForestRegressor gives better and acceptable result as compared to linear regression.
#it plots a scatter plot
#we can see the distribution of actual values and predicted values
plt.figure(figsize = (8, 6))
plt.scatter(y_test, predictions)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted')
plt.show()
Discussion of Result:
From the above graph, we can see the distribution of actual values VS predicted values. If we observe it, we can't identify any similarities and relationship between the actual values and the predicted values. So, the Linear Regression model is not performed well with the dataset.
#we are splitting the data into independent and dependent variable
X = data[['Indicator ID', 'Geo Type Name', 'Geo Place Name', 'year', 'month']]
y = data['Data Value']
#Here we are encoding the categorical variables
X = pd.get_dummies(X)
#we are spliting the data into training 80% and testing 20%.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#making a model of Support Vector Regression
model = SVR(kernel='rbf')
#we are training the model
model.fit(X_train, y_train)
SVR()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVR()
#Here we are getting predictions
predictions = model.predict(X_test)
#we are calculating the mean squared error which tell us about the model performance
mse = mean_squared_error(y_test, predictions)
#we can see the mean squared error
mse
2094.799139328058
Here, we can see the mean squared error which we got from Support Vector Regression model which is very higher and the model not performed well with this dataset. The RandomForestRegressor gives better and acceptable result as compared to support vector regression.
#it plots a scatter plot
#we can see the distribution of actual values and predicted values
plt.figure(figsize = (8, 6))
plt.scatter(y_test, predictions)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted')
plt.show()
Discussion of Result:
From the above graph, we can see the distribution of actual values VS predicted values. If we observe it, we can't identify any similarities and relationship between the actual values and the predicted values. So, the Support Vector Regression model is not performed well with the dataset.
In conclusion, this project aimed to analyze air quality data to understand temporal and spatial variations in air pollution levels. We started from preprocessing the data which converting data types, handling missing values, and performing feature engineering techniques to develop relevant features for modeling.
After that, we develop predictive models using machine learning algorithms like Random Forest Regressor to predict air pollution levels based on different factors like geographic location, time period, and other indicators. The model performed well with reasonable mean squared error which indicates predictive capability.
In Exploratory Data Analysis, we have given all the details of the New York data, and the global data. From the new york data, we can see so many pollutants. From that we choosed PM 2.5 because it has so many health risks like lung cancer, respiratory issues, etc... We also get to know how the PM 2.5 pollutant values are changing by time period like today it may show some less value and tommorow may higher. And we also seen the PM 2.5 pollutant values across different parent locations of different countries.
We have also plotted maps using folium and plotly libraries to see the new york city's present data which can change according to the time (if we re-run the code cell, we will get an updated value) and another is global data which has parent locations of different countries with their PM 2.5 values (we can see the values when we moved our cursor to the particular location we want to know).
In conclusion, while we were able to address some of the research questions posed in the proposal, there is still room for improvement and further investigation. Future extensions of this work could include incorporating additional data sources, such as meteorological data, traffic patterns, and land use data, to enhance the predictive models' accuracy and robustness. Additionally, exploring more advanced machine learning algorithms and ensemble techniques could lead to better predictions and a deeper understanding of air quality dynamics.